Skip to main content

About the Provider

Alibaba Cloud is the cloud computing arm of Alibaba Group and the creator of the Qwen model family. Through its open-source initiative, Alibaba has released state-of-the-art language and multimodal models under permissive licenses, enabling developers and enterprises to build powerful AI applications across diverse domains and languages.

Model Quickstart

This section helps you quickly get started with the Qwen/Qwen3-VL-235B-A22B-Thinking model on the Qubrid AI inferencing platform. To use this model, you need:
  • A valid Qubrid API key
  • Access to the Qubrid inference API
  • Basic knowledge of making API requests in your preferred language
Once authenticated with your API key, you can send inference requests to the Qwen/Qwen3-VL-235B-A22B-Thinking model and receive responses based on your input prompts. Below are example placeholders showing how the model can be accessed using different programming environments.
You can choose the one that best fits your workflow.
from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Thinking",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image? Describe the main elements."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
            }
          }
        ]
      }
    ],
    max_tokens=4096,
    temperature=0.7,
    top_p=0.9,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

Model Overview

Qwen3-VL-235B-A22B-Thinking is the most powerful vision-language model in the Qwen series.
  • With 235B total parameters and 22B active per token, it excels in multimodal STEM and math reasoning, visual agent tasks, GUI automation, spatial perception, long video comprehension, and multilingual OCR across 32 languages.
  • Its thinking mode enables deep chain-of-thought reasoning over complex visual inputs, with a 256K native context window expandable to 1M tokens.

Model at a Glance

FeatureDetails
Model IDQwen/Qwen3-VL-235B-A22B-Thinking
ProviderAlibaba Cloud (Qwen Team)
ArchitectureSparse MoE Transformer with DeepStack multi-level ViT feature fusion and Interleaved-MRoPE for video temporal reasoning
Model Size235B Total / 22B Active
Context Length256K Tokens (up to 1M)
Release Date2025
LicenseApache 2.0
Training DataLarge-scale multimodal dataset across 32 languages; RL post-training with thinking mode for deep reasoning

When to use?

You should consider using Qwen3-VL-235B-A22B-Thinking if:
  • You need visual STEM and math reasoning with deep chain-of-thought
  • Your application requires GUI automation or visual agent tasks
  • Your use case involves multimodal coding from images or video
  • You need long video understanding and temporal reasoning
  • Your workflow requires multilingual OCR across 32 languages
  • You need 3D grounding and spatial reasoning over visual inputs

Inference Parameters

Parameter NameTypeDefaultDescription
StreamingbooleantrueEnable streaming responses for real-time output.
Temperaturenumber0.7Controls randomness in output.
Max Tokensnumber4096Maximum tokens to generate.
Top Pnumber0.9Controls nucleus sampling.

Key Features

  • Thinking Mode: Built-in chain-of-thought reasoning for deep multimodal problem solving across STEM, math, and visual tasks.
  • DeepStack Multi-Level ViT Fusion: Multi-level visual feature fusion for fine-grained image and document understanding.
  • Interleaved-MRoPE: Advanced positional encoding for precise video temporal reasoning across long sequences.
  • 256K Native Context: Supports up to 1M tokens — enabling long video comprehension and large document analysis.
  • Rivals Gemini 2.5 Pro: Competitive on perception and multimodal reasoning benchmarks at open-weight scale.
  • Multilingual OCR: Accurate text recognition across 32 languages in images and documents.
  • Apache 2.0 License: Fully open source with full commercial freedom.

Summary

Qwen3-VL-235B-A22B-Thinking is the flagship vision-language model of the Qwen series, built for deep multimodal reasoning.
  • It uses a Sparse MoE Transformer with DeepStack ViT fusion and Interleaved-MRoPE, with 235B total and 22B active parameters per token.
  • It rivals Gemini 2.5 Pro on perception benchmarks and leads in GUI automation, visual STEM reasoning, and multilingual OCR.
  • The model supports 256K native context (up to 1M), thinking mode for chain-of-thought reasoning, and 32 languages.
  • Licensed under Apache 2.0 for full commercial use.